fix(registration): jitter cooldown exit and rate-limit registration retries by andrewazores · Pull Request #860 · cryostatio/cryostat-agent

andrewazores · 2026-04-30T20:18:30Z

Based on #858
Depends on #858
See #851

Adds two more behaviours:

adds jitter to the cooldown time so that if multiple Agent instances enter failure cooldown around the same time, they don't all exit cooldown at the same moment and flood the Cryostat server. This can happen if the Cryostat server itself has failed, for example.
adds a retry rate limit on registration so that the Agent will not re-attempt registration too rapidly, even if it has been pinged by the Cryostat server asking it to refresh its registration.

…herd

…if too rapid

…l ping if too rapid

jtolentino1

LGTM from my testing.

I tested the newer integrated images (cryostat-agent-init:registration-herd-6 and cryostat:4.2.0-registration-herd- 5) on OpenShift with 22 injected Agent replicas. The 30+ minute soak stayed stable with 22 ready pods and 22 Cryostat targets, and aliases/connectUrls matched the live Agent pods. Scaling 22 -> 12 -> 22 also converged and Cryostat tracked the instances correctly.

I also tested the registration behavior directly. Repeated refresh pings returned 204, but the Agent logged the
minimum-interval skips instead of rapidly re-registering, and the credential id stayed unchanged. After killing
Cryostat with kill 1, several Agents entered cooldown with different jittered durations around the 30s base, and the
system recovered back to 22/22 targets after about 3 minutes.

For the Cryostat-side changes, I saw the restart path using periodic discovery jobs with no old discovery.startup
jobs left, and the new fault-tolerance rate limits fired for the registration/credential paths during recovery.

One note: after restart/recovery I did see stale discovery.periodic Quartz jobs logging Plugin not found, and the
DB had more periodic jobs/credentials than live plugins, but the visible target state recovered correctly.

andrewazores · 2026-05-04T20:29:37Z

Thanks for the detailed analysis @jtolentino1 !

One note: after restart/recovery I did see stale discovery.periodic Quartz jobs logging Plugin not found, and the
DB had more periodic jobs/credentials than live plugins, but the visible target state recovered correctly.

This is "expected" in the current server-side implementation - when the job next runs, it'll cancel its own trigger if it detects that the Target it's set up for has disappeared. After a few minutes the persisted periodic jobs state in the database should settle back to 1:1 with the discovered targets once the system has made a full recovery.

https://github.com/cryostatio/cryostat/blob/8fced699fe4def3aa5fbd95fdc16ce18dabc2789/src/main/java/io/cryostat/discovery/Discovery.java#L974

Credentials should also eventually settle back to 1:1, but it's not critical if there are stale Credentials left around with 0 matching targets. If that is the case then that's another bug we should fix, but I think that can wait.

andrewazores added fix safe-to-test labels Apr 30, 2026

andrewazores force-pushed the registration-herd branch 2 times, most recently from 7293cdf to f3c4af4 Compare May 1, 2026 14:07

andrewazores marked this pull request as ready for review May 1, 2026 15:17

andrewazores requested a review from a team May 1, 2026 15:17

andrewazores mentioned this pull request May 1, 2026

fix(registration): reuse stored credentials if already in keyring #861

Merged

andrewazores force-pushed the registration-herd branch from 4411ada to 9e7e329 Compare May 1, 2026 18:04

andrewazores added 4 commits May 4, 2026 11:30

fix(registration): add jitter to cooldown exit to prevent thundering …

980d7dd

…herd

test

216d4e8

fix(registration): refuse to attempt registration from external ping …

e006c7d

…if too rapid

fixup! fix(registration): refuse to attempt registration from externa…

b916444

…l ping if too rapid

andrewazores force-pushed the registration-herd branch from 9e7e329 to b916444 Compare May 4, 2026 15:30

jtolentino1 approved these changes May 4, 2026

View reviewed changes

jtolentino1 mentioned this pull request May 4, 2026

fix(discovery-plugin): startup plugin pings are serial, add more fault-tolerance cryostatio/cryostat#1496

Merged

7 tasks

andrewazores merged commit 7f6af28 into cryostatio:main May 4, 2026
9 checks passed

andrewazores deleted the registration-herd branch May 4, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(registration): jitter cooldown exit and rate-limit registration retries#860

fix(registration): jitter cooldown exit and rate-limit registration retries#860
andrewazores merged 4 commits intocryostatio:mainfrom
andrewazores:registration-herd

andrewazores commented Apr 30, 2026

Uh oh!

jtolentino1 left a comment

Uh oh!

andrewazores commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewazores commented Apr 30, 2026

Uh oh!

jtolentino1 left a comment

Choose a reason for hiding this comment

Uh oh!

andrewazores commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants